Add Intl.Segmenter support #539

ExplodingCabbage · 2024-08-01T12:24:27Z

Resolves #438; probably also adequately resolves #214 even though it's not quite the solution that was asked for in that issue.

…erage, release notes.)

kpdecker/jsdiff#539

fisker · 2025-05-16T14:39:28Z

test/diff/word.js

+      // 2. "Mei (梅) has (有) many (很多) sons (儿子)"
+      // We want to see that diffWords will get the word counts right and won't try to treat the
+      // trailing 子 as common to both texts (since it's part of a different word each time).
+      // TODO: Check with a Chinese speaker that this example is correct Chinese.


Hello, @ExplodingCabbage, I'm Chinese, I can confirm that the meaning of these two sentences are correct.

But I'm not sure about the test purpose here.

I can see that

> [...chineseSegmenter.segment('我有很多桌子。')].map(({segment})=>segment) [ '我有', '很多', '桌子', '。' ] > [...chineseSegmenter.segment('梅有很多儿子。')].map(({segment})=>segment) [ '梅', '有', '很多', '儿子', '。' ]

Not sure why '我有' are together, but '梅', '有' are separated, if you want test something similar you can change 梅有 to 他有(He has) or 她有(She has).

It's quite strange...

> [...chineseSegmenter.segment('她有很多桌子。')].map(({segment})=>segment) [ '她', '有', '很多', '桌子', '。' ] > [...chineseSegmenter.segment('他有很多桌子。')].map(({segment})=>segment) [ '他有', '很多', '桌子', '。' ]

'她' and '他' are the same, but one for male, one for female.

ExplodingCabbage added 5 commits August 1, 2024 13:23

Add Intl.Segmenter support and some initial tests. (Missing docs, cov…

b09babd

…erage, release notes.)

Get to 100% coverage

714de4c

Document intlSegmenter

b998ddd

Improve docs

bd052a3

Add release notes

e703307

ExplodingCabbage marked this pull request as ready for review August 1, 2024 12:26

ExplodingCabbage merged commit 4f0430a into master Aug 1, 2024

ExplodingCabbage deleted the intl.segmenter branch August 1, 2024 12:30

ryota-ka added a commit to ryota-ka/DefinitelyTyped that referenced this pull request Oct 1, 2024

diffWords now takes an optional intlSegmenter option

9d569b5

kpdecker/jsdiff#539

ryota-ka added a commit to ryota-ka/DefinitelyTyped that referenced this pull request Oct 8, 2024

diffWords now takes an optional intlSegmenter option

5fad2b8

kpdecker/jsdiff#539

ryota-ka added a commit to ryota-ka/DefinitelyTyped that referenced this pull request Oct 8, 2024

diffWords now takes an optional intlSegmenter option

b9b1798

kpdecker/jsdiff#539

kibanamachine mentioned this pull request Jan 6, 2025

[8.x] Update @elastic/kibana-data-discovery dependencies (main) (#202622) elastic/kibana#205647

Merged

fisker reviewed May 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Intl.Segmenter support #539

Add Intl.Segmenter support #539

ExplodingCabbage commented Aug 1, 2024 •

edited

Loading

fisker May 16, 2025

fisker May 16, 2025 •

edited

Loading

Add Intl.Segmenter support #539

Add Intl.Segmenter support #539

Conversation

ExplodingCabbage commented Aug 1, 2024 • edited Loading

fisker May 16, 2025

Choose a reason for hiding this comment

fisker May 16, 2025 • edited Loading

Choose a reason for hiding this comment

ExplodingCabbage commented Aug 1, 2024 •

edited

Loading

fisker May 16, 2025 •

edited

Loading